Real-time Text Analytics Pipeline Using Open-source Big Data Tools

نویسندگان

  • Hassan Nazeer
  • Waheed Iqbal
  • Fawaz S. Bokhari
  • Faisal Bukhari
  • Shuja Ur Rehman Baig
چکیده

Real-time text processing systems are required in many domains to quickly identify patterns, trends, sentiments, and insights. Nowadays, social networks, e-commerce stores, blogs, scientific experiments, and server logs are main sources generating huge text data. However, to process huge text data in real time requires building a data processing pipeline. The main challenge in building such pipeline is to minimize latency to process high-throughput data. In this paper, we explain and evaluate our proposed real-time text processing pipeline using open-source big data tools which minimize the latency to process data streams. Our proposed data processing pipeline is based on Apache Kafka for data ingestion, Apache Spark for in-memory data processing, Apache Cassandra for storing processed results, and D3 JavaScript library for visualization. We evaluate the effectiveness of the proposed pipeline under varying deployment scenarios to perform sentiment analysis using Twitter dataset. Our experimental evaluations show less than a minute latency to process 466, 700 Tweets in 10.7 minutes when three virtual machines allocated to the proposed pipeline. Keywords—Big Data Processing; Apache Spark; Apache Kafka; Real-time Text Processing; Sentiment Analysis.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-Time Analysis of Students’ Activities on an

Real time analytics is the capacity to extract valuables insights from data that comes continuously from activities on the web or network sensors. It is largely used in web based business to drive decisions based on user’s experiences, such dynamic pricing and personalized advertising. Many universities have adopted web based learning in their learning process. They use data-mining techniques t...

متن کامل

Big Data Analytics for Mass Casualty Incident (MCI) Situational Awareness

Introduction A variety of big data analytics, techniques and tools including social media analytics, open source visualizations, statistical anomaly detection, use of Application Programming Interfaces (APIs), and geospatial mapping, are used for infectious disease biosurveillance. Using these methodologies, policy makers and practitioners detect and monitor outbreaks across the world near real...

متن کامل

Parallel and Distributed Data Pipelining with Knime

In recent years a new category of data analysis applications have evolved, known as data pipelining tools, which enable even nonexperts to perform complex analysis tasks on potentially huge amounts of data. Due to the complex and computing intensive analysis processes and methods used, it is often neither sufficient nor possible to simply rely on the increase of performance of single processors...

متن کامل

A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection

Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1712.04344  شماره 

صفحات  -

تاریخ انتشار 2017